Arabizi Identification in Twitter Data
نویسنده
چکیده
In this work we explore some challenges related to analysing one form of the Arabic language called Arabizi. Arabizi, a portmanteau of Araby-Englizi, meaning Arabic-English, is a digital trend in texting Non-Standard Arabic using Latin script. Arabizi users express their natural dialectal Arabic in text without following a unified orthography. We address the challenge of identifying Arabizi from multi-lingual data in Twitter, a preliminary step for analysing sentiment from Arabizi data. We annotated a corpus of Twitter data streamed across two Arab countries, extracted linguistic features and trained a classifier achieving an average Arabizi identification accuracy of 94.5%. We also present the percentage of Arabizi usage on Twitter across both countries providing important insights for researchers in NLP and sociolinguistics.
منابع مشابه
On Cross-Script Information Retrieval
We address the problem of cross-script retrieval in the context of a microblog system such as Twitter. Specifically, we explore methods for using native Arabic script queries to retrieve Arabic tweets written in a Roman script known as Arabizi. For example, a query for “بباتك” would not match “kitab” even though an Arabic reader would see them as the same word. Moreover, because of the lack of ...
متن کاملArabizi Detection and Conversion to Arabic
Arabizi is Arabic text that is written using Latin characters. Arabizi is used to present both Modern Standard Arabic (MSA) or Arabic dialects. It is commonly used in informal settings such as social networking sites and is often with mixed with English. In this paper we address the problems of: identifying Arabizi in text and converting it to Arabic characters. We used word and sequence-level ...
متن کاملA Characterization Study of Arabic Twitter Data with a Benchmarking for State-of-the-Art Opinion Mining Models
Opinion mining in Arabic is a challenging task given the rich morphology of the language. The task becomes more challenging when it is applied to Twitter data, which contains additional sources of noise, such as the use of unstandardized dialectal variations, the nonconformation to grammatical rules, the use of Arabizi and code-switching, and the use of non-text objects such as images and URLs ...
متن کاملA Simple but Effective Approach to Improve Arabizi-to-English Statistical Machine Translation
A major challenge for statistical machine translation (SMT) of Arabic-to-English user-generated text is the prevalence of text written in Arabizi, or Romanized Arabic. When facing such texts, a translation system trained on conventional Arabic-English data will suffer from extremely low model coverage. In addition, Arabizi is not regulated by any official standardization and therefore highly am...
متن کاملTransliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus
This paper describes the process of creating a novel resource, a parallel Arabizi-Arabic script corpus of SMS/Chat data. The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and nonstandard abbreviations are common; and nonlinguistic co...
متن کامل